A Category Resolve Power-Based Feature Selection Method

نویسندگان

  • XU Yan
  • LI Jin-Tao
  • WANG Bin
  • SUN Chun-Ming
چکیده

One of the most important issues in Text Categorization (TC) is Feature Selection (FS). Many FS methods have been put forward and widely used in TC field, such as Information Gain (IG), Document Frequency (DF) thresholding, Mutual Information (MI) and so on. Empirical studies show that IG is one of the most effective methods, DF performs similarly, in contrast, and MI had relatively poor performance. One basic research question is why these FS methods cause different performance. Many existing work answers this question based on empirical studies. This paper presents a formal study of FS based on category resolve power. First, two desirable constraints that any reasonable FS function should satisfy are defined, then a universal method for developing FS functions is presented, and a new FS function KG using this method is developed. Analysis shows that IG and KG (knowledge gain) satisfy this universal method. Experiments on Reuters-21578 collection, NewsGroup collection and OHSUMED collection show that KG and IG get the best performance, even KG performs better than the IG method in two collections. These experiments imply that the universal method is very effective and gives a formal evaluation criterion for FS method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Scheme for Improving Accuracy of KNN Classification Algorithm Based on the New Weighting Technique and Stepwise Feature Selection

K nearest neighbor algorithm is one of the most frequently used techniques in data mining for its integrity and performance. Though the KNN algorithm is highly effective in many cases, it has some essential deficiencies, which affects the classification accuracy of the algorithm. First, the effectiveness of the algorithm is affected by redundant and irrelevant features. Furthermore, this algori...

متن کامل

A Real-Time Electroencephalography Classification in Emotion Assessment Based on Synthetic Statistical-Frequency Feature Extraction and Feature Selection

Purpose: To assess three main emotions (happy, sad and calm) by various classifiers, using appropriate feature extraction and feature selection. Materials and Methods: In this study a combination of Power Spectral Density and a series of statistical features are proposed as statistical-frequency features. Next, a feature selection method from pattern recognition (PR) Tools is presented to e...

متن کامل

A novel method based on a combination of deep learning algorithm and fuzzy intelligent functions in order to classification of power quality disturbances in power systems

Automatic classification of power quality disturbances is the foundation to deal with power quality problem. From the traditional point of view, the identification process of power quality disturbances should be divided into three independent stages: signal analysis, feature selection and classification. However, there are some inherent defects in signal analysis and the procedure of manual fe...

متن کامل

Feature selection using genetic algorithm for classification of schizophrenia using fMRI data

In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...

متن کامل

Category-Based Selection of Effective Parameters for Intrusion Detection

Existing intrusion detection techniques emphasize on building intrusion detection model based on all features provided. In feature-based intrusion detection, some selected features may found to be redundant and useless. Feature selection can reduce the computation power requirements and model complexity. This paper proposes a category-based selection of effective parameters for intrusion detect...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007